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ABSTRACT 

Summary: Most cellular tasks are performed not by individual 
proteins, but by groups of functionally associated proteins, often 
referred to as modules. In a protein assocation network modules 
appear as groups of densely interconnected nodes, also called 
communities or clusters. These modules often overlap with each 
other and form a network of their own, in which nodes (links) 
represent the modules (overlaps). We introduce CFinder, a fast 
program locating and visualizing overlapping, densely interconnected 
groups of nodes in undirected graphs, and allowing the user to easily 
navigate between the original graph and the web of these groups. 
We show that in gene (protein) association networks CFinder can be 
used to predict the function(s) of a single protein and to discover novel 
modules. CFinder is also very efficient for locating the cliques of large 
sparse graphs. 

Availability: CFinder (for Windows, Linux, and Macintosh) and its 
manual can be downloaded from http://angel.elte.hu/clustering . 
Contact: cfinder(5)angel. elte.hu 

1 INTRODUCTION 

High-throughput experimental techniques, e.g., protein-protein interaction 
(PPI) and mRNA expression methods, have largely advanced our knowledge 
about the functioning of the cell. Gene (protein) association networks 
integrate the broadest possible set of evidence - including high-throughput 
data - on protein linkages: they provide an integrated list of binary 
interactions (von Mering et ai, 2005; Salwinski et ai, 2004) and allow the 
discovery of previously uncharacterised cellular systems (Date and Marcotte, 
2003). One major goal of current research efforts is to elucidate how the 
observed behaviours of an entire cell can be understood in terms of the 
interactions of its protein modules. To identify such modules, a common 
approach is to search for groups of densely interconnected nodes in the 
cell's protein association network (Bader and Hogue, 2003; Rives and 
Galitski, 2003). Note, however, that modules strongly overlap. According 
to the CYGD database (Guldener et ai, 2005), in Saccharomyces cerevisiae 
the number of proteins in known protein complexes (modules where the 
participating proteins physically interact at the same time) vs. the sum of 
the sizes of these complexes is 2750/8932. Thus, most protein modules 
probably share many of their proteins with other modules. 

We introduce CFinder, a platform-independent, stand-alone application 
locating overlapping groups of densely interconnected nodes in graphs, and 
illustrate its use on the network of gene associations in the yeast genome. 
We decided to maintain CFinder as an independent program (as opposed to 
a package plugin), because it can be employed by potential users belonging 
to diverse fields including, in addition to bioinformatics, economics or 
sociology. 

*To whom correspondence should be addressed. 



ZOOM community graph around selected community 

S 



WALK Li 



communities of selected vertex 



^ community graph around selected community 



communities of selected vertex 



Dhhip Lsm1p Lsm7p Dcp1p Lsm2p 
Dhhip Lsm1p Lsm7p Lsm2p Pat1p 
Gcn2p Prp4p Snu66p Prp31p 
Lsm3» LsmiD Lsm4n LsmSn Lsm7n D 




Fig. 1. (colour online) Modules of the protein SmD2 in the DIP (Database 
of Interacting Proteins) "yeast core" data set as shown by the Vertices view 
of CFinder. The two modules are coloured blue and green. Overlaps, i.e., 
proteins and links participating in more than one module are red. Enlarged on 
the top are two special buttons enabling the navigation between the original 
network (a part of which is displayed) and the web of its modules. 



Generic graph visualisation and analysis programs (Batagelj and Mrvar, 
1998) are frequently used for the layout and structural analysis of networks. 
Recent bioinformatics software platforms (Shannon et ai, 2003), on the 
other hand, enable the user to integrate many different types of data, e.g., 
PPI, expression levels, and annotation information. CFinder reads a list of 
binary interactions, performs a search for dense subgraphs (groups), and 
- unlike several currently used algorithms (Newman, 2004) - it allows 
for any node to belong to more than one group. Due to its algorithm 
and implementation, CFinder is efficient for networks with millions of 
nodes and, as a byproduct of its search, the full clique overlap matrix of 
the network is determined. Below we will show that in gene association 
networks CFinder's results can be used to predict novel modules and novel 
individual protein functions. 

2 OVERVIEW OF CFINDER 

The input of CFinder is a file containing strings and numbers ordered into 
three columns; in each row the first two strings correspond to the two end 
points of a link and the third item is the weight of this link. 

The computational core of CFinder was implemented in C-i~i-, while the 
visualisation and analysis components were written in Java. The search 
algorithm uses the Clique Percolation Method (CPM, see Derenyi et ah, 
2005) to locate the k-clique percolation clusters of the network that we 
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Fig. 2. (colour online) (a) The network of modules mapped by CFinder in the DIP "yeast full" data set (k = 4). (b-d) In addition to locating known complexes, 
CFinder often groups together a known complex with one additional protein, allowing the improvement of the functional annotation of that protein (Msh2, 
Vps8). (e) Zooming into the network of modules and adding Gene Ontology (GO) annotation terms (i) produces a detailed and well- structured layout of the 
original network of proteins, (ii) provides characterisation for individual proteins (Eebl, Rts3) and (iii) predicts new modules (dark blue and brown, see text). 



interpret as modules. A fc-cHque is a complete subgraph on k nodes {k = 
3, 4, . . .), and two fc-cliques are said to be adjacent, if they share exactly 
— 1 nodes. A fc-clique percolation cluster consists of (i) all nodes that can be 
reached via chains of adjacent /c-cliques from each other and (ii) the links in 
these cliques. Note that larger values of k correspond to a higher stringency 
during the identification of dense groups and provide smaller groups with a 
higher density of links inside them. For both local and global analyses in a 
network, we suggest using such a value of k (typically between 4 and 6) that 
provides the user with the richest group structure (see Palla et al., 2005). In 
the presence of link weights CFinder can apply lower and upper cutoff values 
to keep only the set of connections judged to be significant by the user. 

The user interface of CFinder offers several views of the analysed network 
and its module structure. As an example. Fig. 1 shows the modules of 
the protein Pwp2 in the DIP "yeast core" network (Salwinski et al., 2004) 
at clique size k = 4. Alternative views currently available in CFinder 
are "Communities" (displaying the identified modules), "Cliques", "Stats" 
(statistics of, e.g., module and overlap sizes) and "Graph of communities". 
The special buttons "forward", "back", "zoom" and "walk" allow a quick 
navigation between the views. A wide variety of visualisation settings can 
be adjusted in the "Tools" menu. 

Figure 2 displays the network of modules produced by CFinder (k = 4) in 
the DIP "yeast full" data set. In the complete map (a) each node represents 
a module, the area of a node is proportional to the number of proteins in 
the corresponding module, and the width of a link is proportional to the 
number of proteins shared by the two modules. Panel (b) shows a previously 
known complex identified by CFinder. Panels (c) and (d) both display a 
known complex grouped together with one additional protein (Msh2 and 
Vps8, respectively), leading to an improved functional annotation of that 
protein. In panel (e) Eebl (function currently unknown) is grouped together 
with proteins participating in vesicle-mediated transport, thus, we predict 
this to be a key function of Eebl. Proteins in the marked dark blue and 
brown groups of panel (e) cooperate on the establishment of cell polarity, 
a function performed by a total of 103 proteins in the cell. (Please, see 
colour figure online.) We anticipate that these two groups are biologically 
meaningful, novel modules within that larger set of 103 proteins. [Gene 
names and annotations were handled with Perl tools, e.g., GO::TermFinder 
(Boyle et 2004).] 
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